{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 从头训练tokenizer的示例\n",
    "\n",
    "该notebook提供一个完整的使用[sentencepiece](https://github.com/google/sentencepiece)工具，训练一个BPE算法分词的示例。\n",
    "\n",
    "SentencePiece是一种基于未经分词的文本进行训练的通用分词工具，由Google开发。它可以用于多种语言，并支持多种分词算法，如BPE（Byte Pair Encoding）和Unigram Language Model。SentencePiece的主要特点和功能如下：\n",
    "\n",
    "1. 通用性：SentencePiece可以适用于多种语言，包括英语、中文、日语、韩语等。这使得它成为处理多语言文本和跨语言任务的理想选择。\n",
    "\n",
    "1. 未经分词的训练：与传统的分词工具不同，SentencePiece的训练过程不需要经过预分词的文本作为输入。它可以直接训练原始的未经分词的文本，从而更好地捕捉语言的特征和模式。\n",
    "\n",
    "1. 分词算法支持：SentencePiece支持多种分词算法，其中最常用的是BPE和Unigram Language Model。BPE算法逐步合并出现频率最高的字符对，而Unigram Language Model则通过学习每个子词的概率分布来进行分词。\n",
    "\n",
    "1. 灵活的训练选项：SentencePiece提供了丰富的训练选项，可以控制词汇表大小、分词算法参数等。这使得用户可以根据具体任务和需求进行定制化的分词训练。\n",
    "\n",
    "1. 易用的API：SentencePiece提供了多种编程语言的API接口，包括Python、C++、Java等。这使得开发人员可以方便地在各种NLP框架和应用中使用SentencePiece。\n",
    "\n",
    "1. 预训练模型和共享模型：SentencePiece提供了一些预训练的模型，可以用于快速开始分词任务。此外，用户还可以共享和下载其他用户训练好的模型，节省了训练时间和资源。\n",
    "\n",
    "使用SentencePiece进行分词的一般流程如下：\n",
    "\n",
    "1. 准备训练数据：收集并准备用于训练的原始文本数据，可以是未经分词的文本。\n",
    "\n",
    "1. 训练模型：使用SentencePiece提供的API，将原始文本数据作为输入进行模型训练。可以选择合适的分词算法和训练参数。\n",
    "\n",
    "1. 应用模型：将训练好的模型应用于实际的分词任务。可以将模型加载到相关的NLP应用中，或使用SentencePiece提供的API进行分词操作。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 安装库和数据准备\n",
    "\n",
    "我们在这个示例中使用一个小的训练语料(红楼梦.txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#仔细检查下自己的运行目录，相对路径需要注意下。\n",
    "!pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2023-08-18 16:12:28--  https://github.com/shjwudp/shu/raw/master/books/%E7%BA%A2%E6%A5%BC%E6%A2%A6.txt\n",
      "Resolving github.com (github.com)... 20.205.243.166\n",
      "Connecting to github.com (github.com)|20.205.243.166|:443... connected.\n",
      "HTTP request sent, awaiting response... No data received.\n",
      "Retrying.\n",
      "\n",
      "--2023-08-18 16:14:30--  (try: 2)  https://github.com/shjwudp/shu/raw/master/books/%E7%BA%A2%E6%A5%BC%E6%A2%A6.txt\n",
      "Connecting to github.com (github.com)|20.205.243.166|:443... connected.\n",
      "HTTP request sent, awaiting response... No data received.\n",
      "Retrying.\n",
      "\n",
      "--2023-08-18 16:16:32--  (try: 3)  https://github.com/shjwudp/shu/raw/master/books/%E7%BA%A2%E6%A5%BC%E6%A2%A6.txt\n",
      "Connecting to github.com (github.com)|20.205.243.166|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/shjwudp/shu/master/books/%E7%BA%A2%E6%A5%BC%E6%A2%A6.txt [following]\n",
      "--2023-08-18 16:17:39--  https://raw.githubusercontent.com/shjwudp/shu/master/books/%E7%BA%A2%E6%A5%BC%E6%A2%A6.txt\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 2622979 (2.5M) [text/plain]\n",
      "Saving to: ‘../data/红楼梦.txt’\n",
      "\n",
      "红楼梦.txt          100%[===================>]   2.50M  42.6KB/s    in 4m 20s  \n",
      "\n",
      "2023-08-18 16:22:01 (9.85 KB/s) - ‘../data/红楼梦.txt’ saved [2622979/2622979]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# ! pip install sentencepiece\n",
    "\n",
    "!wget https://github.com/shjwudp/shu/raw/master/books/红楼梦.txt  -P ../data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 端到端示例"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['▁', '满', '纸', '荒', '唐', '言', ',', '一把', '辛', '酸', '泪', '!', '都', '云', '作', '者', '痴', ',', '谁', '解', '其', '中', '味', '?']\n",
      "[29, 368, 711, 2140, 2215, 244, 3, 1725, 3315, 1797, 477, 45, 39, 389, 86, 354, 928, 3, 182, 493, 261, 87, 1509, 41]\n",
      "满纸荒唐言,一把辛酸泪\n",
      "满纸荒唐言,一把辛酸泪\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=../data/红楼梦.txt --model_prefix=meng --vocab_size=5000\n",
      "sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : \n",
      "trainer_spec {\n",
      "  input: ../data/红楼梦.txt\n",
      "  input_format: \n",
      "  model_prefix: meng\n",
      "  model_type: UNIGRAM\n",
      "  vocab_size: 5000\n",
      "  self_test_sample_size: 0\n",
      "  character_coverage: 0.9995\n",
      "  input_sentence_size: 0\n",
      "  shuffle_input_sentence: 1\n",
      "  seed_sentencepiece_size: 1000000\n",
      "  shrinking_factor: 0.75\n",
      "  max_sentence_length: 4192\n",
      "  num_threads: 16\n",
      "  num_sub_iterations: 2\n",
      "  max_sentencepiece_length: 16\n",
      "  split_by_unicode_script: 1\n",
      "  split_by_number: 1\n",
      "  split_by_whitespace: 1\n",
      "  split_digits: 0\n",
      "  treat_whitespace_as_suffix: 0\n",
      "  allow_whitespace_only_pieces: 0\n",
      "  required_chars: \n",
      "  byte_fallback: 0\n",
      "  vocabulary_output_piece_score: 1\n",
      "  train_extremely_large_corpus: 0\n",
      "  hard_vocab_limit: 1\n",
      "  use_all_vocab: 0\n",
      "  unk_id: 0\n",
      "  bos_id: 1\n",
      "  eos_id: 2\n",
      "  pad_id: -1\n",
      "  unk_piece: <unk>\n",
      "  bos_piece: <s>\n",
      "  eos_piece: </s>\n",
      "  pad_piece: <pad>\n",
      "  unk_surface:  ⁇ \n",
      "}\n",
      "normalizer_spec {\n",
      "  name: nmt_nfkc\n",
      "  add_dummy_prefix: 1\n",
      "  remove_extra_whitespaces: 1\n",
      "  escape_whitespaces: 1\n",
      "  normalization_rule_tsv: \n",
      "}\n",
      "denormalizer_spec {}\n",
      "trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.\n",
      "trainer_interface.cc(178) LOG(INFO) Loading corpus: ../data/红楼梦.txt\n",
      "trainer_interface.cc(356) LOG(WARNING) Found too long line (4224 > 4192).\n",
      "trainer_interface.cc(358) LOG(WARNING) Too long lines are skipped in the training.\n",
      "trainer_interface.cc(359) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.\n",
      "trainer_interface.cc(385) LOG(INFO) Loaded all 3144 sentences\n",
      "trainer_interface.cc(391) LOG(INFO) Skipped 6 too long sentences.\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <unk>\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: <s>\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: </s>\n",
      "trainer_interface.cc(405) LOG(INFO) Normalizing sentences...\n",
      "trainer_interface.cc(466) LOG(INFO) all chars count=866703\n",
      "trainer_interface.cc(477) LOG(INFO) Done: 99.95% characters are covered.\n",
      "trainer_interface.cc(487) LOG(INFO) Alphabet size=3986\n",
      "trainer_interface.cc(488) LOG(INFO) Final character coverage=0.9995\n",
      "trainer_interface.cc(520) LOG(INFO) Done! preprocessed 3144 sentences.\n",
      "unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...\n",
      "unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...\n",
      "unigram_model_trainer.cc(194) LOG(INFO) Initialized 118693 seed sentencepieces\n",
      "trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 3144\n",
      "trainer_interface.cc(537) LOG(INFO) Done! 3395\n",
      "unigram_model_trainer.cc(489) LOG(INFO) Using 3395 sentences for EM training\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=73905 obj=1142.99 num_tokens=438712 num_tokens/piece=5.93616\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=67679 obj=1080.79 num_tokens=440306 num_tokens/piece=6.5058\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=50670 obj=1098.66 num_tokens=455862 num_tokens/piece=8.99668\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=50489 obj=1091.95 num_tokens=456012 num_tokens/piece=9.03191\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=37857 obj=1116.71 num_tokens=475366 num_tokens/piece=12.5569\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=37851 obj=1109.43 num_tokens=475504 num_tokens/piece=12.5625\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=28388 obj=1136.25 num_tokens=494998 num_tokens/piece=17.4369\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=28386 obj=1128.99 num_tokens=495035 num_tokens/piece=17.4394\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=21289 obj=1158.43 num_tokens=514125 num_tokens/piece=24.1498\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=21288 obj=1151.71 num_tokens=514238 num_tokens/piece=24.1562\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=15966 obj=1181.42 num_tokens=534415 num_tokens/piece=33.4721\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=15966 obj=1174.79 num_tokens=534461 num_tokens/piece=33.4749\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=11974 obj=1207.47 num_tokens=555581 num_tokens/piece=46.3989\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=11974 obj=1200.8 num_tokens=555619 num_tokens/piece=46.4021\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=8980 obj=1236.37 num_tokens=579214 num_tokens/piece=64.5004\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=8980 obj=1228.95 num_tokens=579220 num_tokens/piece=64.5011\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=6735 obj=1271.78 num_tokens=607862 num_tokens/piece=90.2542\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=6735 obj=1262.23 num_tokens=607865 num_tokens/piece=90.2546\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5500 obj=1299.97 num_tokens=635264 num_tokens/piece=115.503\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5500 obj=1290.84 num_tokens=635270 num_tokens/piece=115.504\n",
      "trainer_interface.cc(615) LOG(INFO) Saving model: meng.model\n",
      "trainer_interface.cc(626) LOG(INFO) Saving vocabs: meng.vocab\n"
     ]
    }
   ],
   "source": [
    "import sentencepiece as spm\n",
    "\n",
    "# 用红楼梦.txt训练一个sentencepiece模型，模型前缀model_prefix=meng, 会生成meng.model, meng.vocab.\n",
    "# meng.vocab仅仅是一个参考，在分词中并未使用。\n",
    "spm.SentencePieceTrainer.train('--input=../data/红楼梦.txt --model_prefix=meng --vocab_size=5000')\n",
    "\n",
    "# 实例化一个分词实例，然后加载训练好的meng.model\n",
    "sp = spm.SentencePieceProcessor()\n",
    "sp.load('meng.model')\n",
    "\n",
    "# encode: text => id\n",
    "print(sp.encode_as_pieces('满纸荒唐言，一把辛酸泪！都云作者痴，谁解其中味？'))\n",
    "print(sp.encode_as_ids('满纸荒唐言，一把辛酸泪！都云作者痴，谁解其中味？'))\n",
    "\n",
    "# decode: id => text\n",
    "print(sp.decode_pieces(['满', '纸', '荒', '唐', '言', ',', '一把', '辛', '酸', '泪']))\n",
    "print(sp.decode_ids([368, 711, 2140, 2215, 244, 3, 1725, 3315, 1797, 477]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "词表大小=5000\n",
      "[716]\n",
      "▁宝玉\n",
      "22\n",
      "0\n",
      "<unk> False\n",
      "<s> True\n",
      "</s> True\n"
     ]
    }
   ],
   "source": [
    "# 返回 vocab size\n",
    "print(f\"词表大小={sp.get_piece_size()}\")\n",
    "\n",
    "print(sp.encode_as_ids(\"宝玉\"))\n",
    "# id <=> piece conversion\n",
    "print(sp.id_to_piece(716))\n",
    "print(sp.piece_to_id('宝玉'))\n",
    "\n",
    "# id=0的位置留着给UNK token, 可对其进行修改\n",
    "print(sp.piece_to_id('__MUST_BE_UNKNOWN__'))\n",
    "\n",
    "# 控制符 unk, <s>, </s> 默认id对应（0,1,2）\n",
    "for id in range(3):\n",
    "      print(sp.id_to_piece(id), sp.is_control(id))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded SentencePiece model from llama2enzh/tokenizer.model\n",
      "#words: 55296 - BOS ID: 1 - EOS ID: 2 - PAD ID: -1 - UNK ID : 0\n",
      "Loaded SentencePiece model from meng.model\n",
      "#words: 5000 - BOS ID: 1 - EOS ID: 2 - PAD ID: -1 - UNK ID : 0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.\n"
     ]
    }
   ],
   "source": [
    "# 加载一个社区训练好的tokenizer对比下。\n",
    "\n",
    "from sentencepiece import SentencePieceProcessor\n",
    "model_path = \"llama2enzh/tokenizer.model\"\n",
    "sp_model = SentencePieceProcessor(model_file=model_path)\n",
    "print(f\"Loaded SentencePiece model from {model_path}\")\n",
    "\n",
    "# BOS / EOS token IDs\n",
    "n_words: int = sp_model.vocab_size()\n",
    "bos_id: int = sp_model.bos_id()\n",
    "eos_id: int = sp_model.eos_id()\n",
    "pad_id: int = sp_model.pad_id()\n",
    "unk_id: int = sp_model.unk_id()\n",
    "print(f\"#words: {n_words} - BOS ID: {bos_id} - EOS ID: {eos_id} - PAD ID: {pad_id} - UNK ID : {unk_id}\")\n",
    "\n",
    "\n",
    "model_path = \"meng.model\"\n",
    "sp_model = SentencePieceProcessor(model_file=model_path)\n",
    "print(f\"Loaded SentencePiece model from {model_path}\")\n",
    "\n",
    "# BOS / EOS token IDs\n",
    "n_words: int = sp_model.vocab_size()\n",
    "bos_id: int = sp_model.bos_id()\n",
    "eos_id: int = sp_model.eos_id()\n",
    "pad_id: int = sp_model.pad_id()\n",
    "print(f\"#words: {n_words} - BOS ID: {bos_id} - EOS ID: {eos_id} - PAD ID: {pad_id} - UNK ID : {unk_id}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 修改这些控制符的位置\n",
    "\n",
    "\n",
    "默认情况, UNK/BOS/EOS/PAD 这些token的是按照如下定义的:\n",
    "\n",
    "|token|UNK|BOS|EOS|PAD|\n",
    "---|---\n",
    "|surface|&lt;unk&gt;|&lt;s&gt;|&lt;/s&gt;|&lt;pad&gt;|\n",
    "|id|0|1|2|undefined (-1)|\n",
    "\n",
    "\n",
    "我们可以通过这些参数对齐修改 **--{unk|bos|eos|pad}_id** and **--{unk|bos|eos|pad}_piece** flags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[PAD] True\n",
      "[UNK] False\n",
      "[BOS] True\n",
      "[EOS] True\n",
      "#words: 5000 - BOS ID: 2 - EOS ID: 3 - PAD ID: 0 - UNK ID : 0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=../data/红楼梦.txt --model_prefix=meng --vocab_size=5000 --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]\n",
      "sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : \n",
      "trainer_spec {\n",
      "  input: ../data/红楼梦.txt\n",
      "  input_format: \n",
      "  model_prefix: meng\n",
      "  model_type: UNIGRAM\n",
      "  vocab_size: 5000\n",
      "  self_test_sample_size: 0\n",
      "  character_coverage: 0.9995\n",
      "  input_sentence_size: 0\n",
      "  shuffle_input_sentence: 1\n",
      "  seed_sentencepiece_size: 1000000\n",
      "  shrinking_factor: 0.75\n",
      "  max_sentence_length: 4192\n",
      "  num_threads: 16\n",
      "  num_sub_iterations: 2\n",
      "  max_sentencepiece_length: 16\n",
      "  split_by_unicode_script: 1\n",
      "  split_by_number: 1\n",
      "  split_by_whitespace: 1\n",
      "  split_digits: 0\n",
      "  treat_whitespace_as_suffix: 0\n",
      "  allow_whitespace_only_pieces: 0\n",
      "  required_chars: \n",
      "  byte_fallback: 0\n",
      "  vocabulary_output_piece_score: 1\n",
      "  train_extremely_large_corpus: 0\n",
      "  hard_vocab_limit: 1\n",
      "  use_all_vocab: 0\n",
      "  unk_id: 1\n",
      "  bos_id: 2\n",
      "  eos_id: 3\n",
      "  pad_id: 0\n",
      "  unk_piece: [UNK]\n",
      "  bos_piece: [BOS]\n",
      "  eos_piece: [EOS]\n",
      "  pad_piece: [PAD]\n",
      "  unk_surface:  ⁇ \n",
      "}\n",
      "normalizer_spec {\n",
      "  name: nmt_nfkc\n",
      "  add_dummy_prefix: 1\n",
      "  remove_extra_whitespaces: 1\n",
      "  escape_whitespaces: 1\n",
      "  normalization_rule_tsv: \n",
      "}\n",
      "denormalizer_spec {}\n",
      "trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator.\n",
      "trainer_interface.cc(178) LOG(INFO) Loading corpus: ../data/红楼梦.txt\n",
      "trainer_interface.cc(356) LOG(WARNING) Found too long line (4224 > 4192).\n",
      "trainer_interface.cc(358) LOG(WARNING) Too long lines are skipped in the training.\n",
      "trainer_interface.cc(359) LOG(WARNING) The maximum length can be changed with --max_sentence_length=<size> flag.\n",
      "trainer_interface.cc(385) LOG(INFO) Loaded all 3144 sentences\n",
      "trainer_interface.cc(391) LOG(INFO) Skipped 6 too long sentences.\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [PAD]\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [UNK]\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [BOS]\n",
      "trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [EOS]\n",
      "trainer_interface.cc(405) LOG(INFO) Normalizing sentences...\n",
      "trainer_interface.cc(466) LOG(INFO) all chars count=866703\n",
      "trainer_interface.cc(477) LOG(INFO) Done: 99.95% characters are covered.\n",
      "trainer_interface.cc(487) LOG(INFO) Alphabet size=3986\n",
      "trainer_interface.cc(488) LOG(INFO) Final character coverage=0.9995\n",
      "trainer_interface.cc(520) LOG(INFO) Done! preprocessed 3144 sentences.\n",
      "unigram_model_trainer.cc(139) LOG(INFO) Making suffix array...\n",
      "unigram_model_trainer.cc(143) LOG(INFO) Extracting frequent sub strings...\n",
      "unigram_model_trainer.cc(194) LOG(INFO) Initialized 118693 seed sentencepieces\n",
      "trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 3144\n",
      "trainer_interface.cc(537) LOG(INFO) Done! 3395\n",
      "unigram_model_trainer.cc(489) LOG(INFO) Using 3395 sentences for EM training\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=73905 obj=1142.99 num_tokens=438712 num_tokens/piece=5.93616\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=67679 obj=1080.79 num_tokens=440306 num_tokens/piece=6.5058\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=50670 obj=1098.66 num_tokens=455862 num_tokens/piece=8.99668\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=50489 obj=1091.95 num_tokens=456012 num_tokens/piece=9.03191\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=37857 obj=1116.71 num_tokens=475366 num_tokens/piece=12.5569\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=37851 obj=1109.43 num_tokens=475504 num_tokens/piece=12.5625\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=28388 obj=1136.25 num_tokens=494998 num_tokens/piece=17.4369\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=28386 obj=1128.99 num_tokens=495035 num_tokens/piece=17.4394\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=21289 obj=1158.43 num_tokens=514125 num_tokens/piece=24.1498\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=21288 obj=1151.71 num_tokens=514238 num_tokens/piece=24.1562\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=15966 obj=1181.42 num_tokens=534415 num_tokens/piece=33.4721\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=15966 obj=1174.79 num_tokens=534461 num_tokens/piece=33.4749\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=11974 obj=1207.47 num_tokens=555581 num_tokens/piece=46.3989\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=11974 obj=1200.8 num_tokens=555619 num_tokens/piece=46.4021\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=8980 obj=1236.37 num_tokens=579214 num_tokens/piece=64.5004\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=8980 obj=1228.95 num_tokens=579220 num_tokens/piece=64.5011\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=6735 obj=1271.78 num_tokens=607862 num_tokens/piece=90.2542\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=6735 obj=1262.23 num_tokens=607865 num_tokens/piece=90.2546\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=0 size=5500 obj=1299.97 num_tokens=635264 num_tokens/piece=115.503\n",
      "unigram_model_trainer.cc(505) LOG(INFO) EM sub_iter=1 size=5500 obj=1290.84 num_tokens=635270 num_tokens/piece=115.504\n",
      "trainer_interface.cc(615) LOG(INFO) Saving model: meng.model\n",
      "trainer_interface.cc(626) LOG(INFO) Saving vocabs: meng.vocab\n"
     ]
    }
   ],
   "source": [
    "spm.SentencePieceTrainer.train('--input=../data/红楼梦.txt --model_prefix=meng --vocab_size=5000 --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 --pad_piece=[PAD] --unk_piece=[UNK] --bos_piece=[BOS] --eos_piece=[EOS]')\n",
    "sp = spm.SentencePieceProcessor()\n",
    "sp.load('meng.model')\n",
    "\n",
    "for id in range(4):\n",
    "    print(sp.id_to_piece(id), sp.is_control(id))\n",
    "\n",
    "n_words: int = sp.vocab_size()\n",
    "bos_id: int = sp.bos_id()\n",
    "eos_id: int = sp.eos_id()\n",
    "pad_id: int = sp.pad_id()\n",
    "print(f\"#words: {n_words} - BOS ID: {bos_id} - EOS ID: {eos_id} - PAD ID: {pad_id} - UNK ID : {unk_id}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BPE (Byte pair encoding) model\n",
    "\n",
    "可通过 -model_type=bpe 指定model类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "*** BPE ***\n",
      "['▁', '满', '纸', '荒', '唐', '言', ',', '一把', '辛', '酸', '泪']\n",
      "[]\n",
      "['▁', '满', '纸', '荒', '唐', '言', ',', '一把', '辛', '酸', '泪', '!', '都', '云', '作', '者', '痴', ',', '谁', '解', '其中', '味', '?']\n",
      "[6060, 6407, 6679, 7346, 7398, 6260, 6014, 983, 7319, 7195, 6353, 6071, 6073, 6222, 6146, 6319, 6793, 6014, 6185, 6328, 1837, 6853, 6051]\n",
      "满纸荒唐言,一把辛酸泪\n",
      "满纸荒唐言,一把辛酸泪\n"
     ]
    }
   ],
   "source": [
    "import sentencepiece as spm\n",
    "spm.SentencePieceTrainer.train('--input=../../data/红楼梦.txt --model_prefix=meng --vocab_size=10000 --model_type=bpe')\n",
    "sp_bpe = spm.SentencePieceProcessor()\n",
    "sp_bpe.load('meng.model')\n",
    "\n",
    "print('*** BPE ***')\n",
    "print(sp_bpe.encode_as_pieces('满纸荒唐言，一把辛酸泪'))\n",
    "print(sp_bpe.nbest_encode_as_pieces('满纸荒唐言', 5))  # returns an empty list.\n",
    "\n",
    "# encode: text => id\n",
    "print(sp_bpe.encode_as_pieces('满纸荒唐言，一把辛酸泪！都云作者痴，谁解其中味？'))\n",
    "print(sp_bpe.encode_as_ids('满纸荒唐言，一把辛酸泪！都云作者痴，谁解其中味？'))\n",
    "\n",
    "# decode: id => text\n",
    "print(sp_bpe.decode_pieces(['满', '纸', '荒', '唐', '言', ',', '一把', '辛', '酸', '泪']))\n",
    "print(sp_bpe.decode_ids([6060, 6407, 6679, 7346, 7398, 6260, 6014, 983, 7319, 7195, 6353]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
